Loading Data from Python into CAS

There are many ways of loading data into CAS. Some methods simply invoke actions in CAS that load data from files on the server. Other methods of loading data involve connections to data sources such as databases in CASLibs. But the last method is to load data from the client side using a language such as Python.

There are multiple methods of loading data from Python into CAS. We will be introducing the different ways of getting your data into CAS from the Python CAS client here.

As usual, the first thing we need to do is to import SWAT and create a connection to CAS.



In [1]:

    
import swat

conn = swat.CAS(host, port, username, password)

Now that we have our connection, let's start loading some data.

Loading Data using Pandas-like Parser Methods

It's possible to load data directly from the connection object that we just created. CAS connection objects support the same data readers that the Pandas package does. This includes read_csv, read_excel, read_json, read_html, etc. The CAS connection versions of these methods are merely thin wrappers around the Pandas versions. The data is read into a DataFrame and then uploaded to a CAS table behind the scenes. The result is a Python CASTable object. Let's look at reading a CSV file first. Keep in mind, that using a URL to specifying the file will download the URL to a local file then upload it to CAS. This won't matter much with a small data set such as this, but if you have larger data sets, you may want to use a different method of loading data.



In [2]:

    
tbl = conn.read_csv('https://raw.githubusercontent.com/sassoftware/sas-viya-programming/master/data/cars.csv')
tbl









    Out[2]:





CASTable('_T_8B0F9FAB_7FFB1018E0C0', caslib='CASUSERHDFS(kesmit)')



In [3]:

    
tbl.head()









    Out[3]:





Selected Rows from Table _T_8B0F9FAB_7FFB1018E0C0
  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      MDX
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      1
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0
    
    
      2
      Audi
      A41.8T convertible 2dr
      Sedan
      Europe
      Front
      35940.0
      32506.0
      1.8
      4.0
      170.0
      23.0
      30.0
      3638.0
      105.0
      180.0
    
    
      3
      Audi
      A6 3.0 4dr
      Sedan
      Europe
      Front
      36640.0
      33129.0
      3.0
      6.0
      220.0
      20.0
      27.0
      3561.0
      109.0
      192.0
    
    
      4
      Audi
      A6 2.7 Turbo Quattro 4dr
      Sedan
      Europe
      All
      42840.0
      38840.0
      2.7
      6.0
      250.0
      18.0
      25.0
      3836.0
      109.0
      192.0

As you can see, getting CSV data into CAS from Python is pretty simple. Of course, if you have a large amount of data in CSV format, you may want to load it from a file on the server, but this method works well for small to medium size data.

To read data from an HTML file, we'll use the read_html method. We also have to add the [0] to the end because the read_html will return a CASTable object for each table in the HTML file. We just want the first one.



In [4]:

    
htmltbl = conn.read_html('https://www.fdic.gov/bank/individual/failed/banklist.html')[0]
htmltbl.head()









    Out[4]:





Selected Rows from Table _T_8B0FB22B_7FFB1018E0C0
  
    
      
      Bank Name
      City
      ST
      CERT
      Acquiring Institution
      Closing Date
      Updated Date
      Loss Share Type
      Agreement Terminated
      Termination Date
    
  
  
    
      0
      The Woodbury Banking Company
      Woodbury
      GA
      11297
      United Bank
      August 19, 2016
      August 30, 2016
      none
      
      
    
    
      1
      Hometown National Bank
      Longview
      WA
      35156
      Twin City Bank
      October 2, 2015
      April 13, 2016
      none
      
      
    
    
      2
      Doral BankEn Espanol
      San Juan
      PR
      32102
      Banco Popular de Puerto Rico
      February 27, 2015
      May 13, 2015
      none
      
      
    
    
      3
      Northern Star Bank
      Mankato
      MN
      34983
      BankVista
      December 19, 2014
      January 6, 2016
      none
      
      
    
    
      4
      GreenChoice Bank, fsb
      Chicago
      IL
      28462
      Providence Bank, LLC
      July 25, 2014
      July 28, 2015
      none

You'll notice that the 6th and 7th columns contain dates. We can use the parse_dates= parameter of read_html to convert those columns into CAS datetime values.



In [5]:

    
htmltbl = conn.read_html('https://www.fdic.gov/bank/individual/failed/banklist.html', 
                         parse_dates=[5, 6])[0]
htmltbl.head()









    Out[5]:





Selected Rows from Table _T_8B0FC63D_7FFB1018E0C0
  
    
      
      Bank Name
      City
      ST
      CERT
      Acquiring Institution
      Closing Date
      Updated Date
      Loss Share Type
      Agreement Terminated
      Termination Date
    
  
  
    
      0
      The Woodbury Banking Company
      Woodbury
      GA
      11297
      United Bank
      2016-08-19
      2016-08-30
      none
      
      
    
    
      1
      Hometown National Bank
      Longview
      WA
      35156
      Twin City Bank
      2015-10-02
      2016-04-13
      none
      
      
    
    
      2
      Doral BankEn Espanol
      San Juan
      PR
      32102
      Banco Popular de Puerto Rico
      2015-02-27
      2015-05-13
      none
      
      
    
    
      3
      Northern Star Bank
      Mankato
      MN
      34983
      BankVista
      2014-12-19
      2016-01-06
      none
      
      
    
    
      4
      GreenChoice Bank, fsb
      Chicago
      IL
      28462
      Providence Bank, LLC
      2014-07-25
      2015-07-28
      none

Here is the result of the table.columninfo action showing the data types of each column.



In [6]:

    
htmltbl.columninfo()









    Out[6]:




§ ColumnInfo



  
    
      
      Column
      ID
      Type
      RawLength
      FormattedLength
      Format
      NFL
      NFD
    
  
  
    
      0
      Bank Name
      1
      varchar
      90
      90
      
      0
      0
    
    
      1
      City
      2
      varchar
      17
      17
      
      0
      0
    
    
      2
      ST
      3
      varchar
      2
      2
      
      0
      0
    
    
      3
      CERT
      4
      int64
      8
      12
      
      0
      0
    
    
      4
      Acquiring Institution
      5
      varchar
      65
      65
      
      0
      0
    
    
      5
      Closing Date
      6
      datetime
      8
      16
      DATETIME
      0
      0
    
    
      6
      Updated Date
      7
      datetime
      8
      16
      DATETIME
      0
      0
    
    
      7
      Loss Share Type
      8
      varchar
      7
      7
      
      0
      0
    
    
      8
      Agreement Terminated
      9
      varchar
      9
      9
      
      0
      0
    
    
      9
      Termination Date
      10
      varchar
      43
      43
      
      0
      0
    
  




elapsed 0.00563s · user 0.003s · sys 0.015s · mem 0.766MB

All of the other read_XXX methods in Pandas work from CAS connection objects and support the same options. See the Pandas documentation for the other data reader methods and options.

Let's move on to the next method of loading data: the upload method.

Loading Data Using the `upload` Method

The next way of loading data is by sending a data file to the server using the upload CAS object method. In this style of loading data, the parsing of the file is done on the server-side; the file is simply uploaded as binary data and the table.loadtable action is run in the background. This has the benefit that the parsing can be done with much faster parsers than what can be done in Python. Some parsers can even execute in parallel so the parsing load is split across nodes in the CAS grid. The downside is that there aren't as many parsing options as what the Pandas data readers offer, so both methods still have their place. Let's look at code that uploads data files into CAS tables.

We'll start off with CSV files since they are ubiquitous.



In [7]:

    
out = conn.upload('https://raw.githubusercontent.com/sassoftware/sas-viya-programming/master/data/cars.csv')
out









    



NOTE: Cloud Analytic Services made the uploaded file available as table CARS in caslib CASUSERHDFS(kesmit).
NOTE: The table CARS has been created in caslib CASUSERHDFS(kesmit) from binary data uploaded to Cloud Analytic Services.






    Out[7]:




§ caslib

CASUSERHDFS(kesmit)

§ tableName

CARS

§ casTable

CASTable('CARS', caslib='CASUSERHDFS(kesmit)')


elapsed 0.13s · user 0.253s · sys 0.226s · mem 259MB

When using the upload method on the CAS object, you'll get the result of the underlying table.upload action call. To get a reference to the CASTable object in that result, you just grab the casTable key.



In [8]:

    
tbl = out['casTable']
tbl.head()









    Out[8]:





Selected Rows from Table CARS
  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      MDX
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      1
      Acura
      RSX Type S 2dr
      Sedan
      Asia
      Front
      23820.0
      21761.0
      2.0
      4.0
      200.0
      24.0
      31.0
      2778.0
      101.0
      172.0
    
    
      2
      Acura
      TSX 4dr
      Sedan
      Asia
      Front
      26990.0
      24647.0
      2.4
      4.0
      200.0
      22.0
      29.0
      3230.0
      105.0
      183.0
    
    
      3
      Acura
      TL 4dr
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0
    
    
      4
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0

The upload method works with any file type that the table.loadtable action supports, but you can also upload a Pandas DataFrame. One thing to note about doing it this way is that the DataFrame data is exported to CSV and then uploaded to CAS, so the resulting table will follow the same rules and limitations as uploading a CSV file.



In [9]:

    
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))
df.head()



In [10]:

    
dftbl = conn.upload(df)['casTable']
dftbl









    



NOTE: Cloud Analytic Services made the uploaded file available as table TMP3I2DL8M2 in caslib CASUSERHDFS(kesmit).
NOTE: The table TMP3I2DL8M2 has been created in caslib CASUSERHDFS(kesmit) from binary data uploaded to Cloud Analytic Services.






    Out[10]:





CASTable('TMP3I2DL8M2', caslib='CASUSERHDFS(kesmit)')



In [11]:

    
dftbl.head()









    Out[11]:





Selected Rows from Table TMP3I2DL8M2
  
    
      
      A
      B
      C
      D
    
  
  
    
      0
      0.702470
      -0.649162
      -0.666066
      1.319843
    
    
      1
      1.285588
      -0.326041
      1.249939
      0.110953
    
    
      2
      -0.212528
      -0.180530
      -0.586989
      0.641180
    
    
      3
      -1.227102
      2.422193
      0.026077
      -0.173956
    
    
      4
      -0.724691
      0.011051
      0.868402
      0.512783



In [12]:

    
dftbl.columninfo()









    Out[12]:




§ ColumnInfo



  
    
      
      Column
      ID
      Type
      RawLength
      FormattedLength
      NFL
      NFD
    
  
  
    
      0
      A
      1
      double
      8
      12
      0
      0
    
    
      1
      B
      2
      double
      8
      12
      0
      0
    
    
      2
      C
      3
      double
      8
      12
      0
      0
    
    
      3
      D
      4
      double
      8
      12
      0
      0
    
  




elapsed 0.00537s · user 0.006s · sys 0.012s · mem 0.755MB

Using Data Message Handlers

The last method of loading data into CAS from Python is using something called "data message handlers" and the addtable action. The addtable action is a bit different than other actions in that it sets up a two-way communication from the server back to the client. Once the action is invoked, the server asks for chunks of data until there isn't any more data to upload. Doing this communication can be a bit tricky, so the SWAT package has some pre-defined data message handlers that can be used directly or subclassed to create your own.

You may not have realized it, but you have already used data message handlers in the first example. The read_XXX methods all use data message handlers to load the data into CAS. You can see all of the supplied data messsage handlers by looking at the subclasses of swat.cas.datamsghandlers.CASDataMsgHandler.



In [13]:

    
from swat.cas import datamsghandlers as dmh

print(dmh.CASDataMsgHandler.__subclasses__())
print(dmh.PandasDataFrame.__subclasses__())









    



[<class 'swat.cas.datamsghandlers.PandasDataFrame'>, <class 'swat.cas.datamsghandlers.DBAPI'>]
[<class 'swat.cas.datamsghandlers.JSON'>, <class 'swat.cas.datamsghandlers.CSV'>, <class 'swat.cas.datamsghandlers.SQLTable'>, <class 'swat.cas.datamsghandlers.Excel'>, <class 'swat.cas.datamsghandlers.FWF'>, <class 'swat.cas.datamsghandlers.SAS7BDAT'>, <class 'swat.cas.datamsghandlers.HTML'>, <class 'swat.cas.datamsghandlers.SQLQuery'>, <class 'swat.cas.datamsghandlers.Clipboard'>, <class 'swat.cas.datamsghandlers.Text'>]

You may notice that most of the data message handlers are just subclasses of the PandasDataFrame message handler. This is because they all convert their data into a Pandas DataFrame before uploading the data. This also means that the data must fit into memory to use them. There is one data message handler that doesn't work that way: swat.cas.datamsghandlers.DBAPI. The DBAPI message handler uses a Python DB-API 2.0 compliant database connection to retrieve data from a database and loads that into CAS. It only loads as much data as is needed for a single chunk. We'll just be looking at the CSV data message handler here. The rest follow the same pattern of usage.

To use the CSV data message handler, you just pass it the path to a CSV file. This creates an instance of the data message handler that can be used to generate the arguments to the addtable action and also handles the request from the server.



In [14]:

    
cvs_dmh = dmh.CSV('https://raw.githubusercontent.com/sassoftware/sas-viya-programming/master/data/cars.csv')
out = conn.addtable(table='cars', replace=True, **cvs_dmh.args.addtable)

The action call arguments may look a little odd if you haven't used Python a lot. The args.addtable property of cvs_dmh contains a dictionary of parameter arguments. When you use the double-asterisk before it, it tells Python to apply the dictionary as keyword arguments to the function. The arguments generated by the data message handler are shown below.



In [15]:

    
cvs_dmh.args.addtable









    Out[15]:





{'datamsghandler': <swat.cas.datamsghandlers.CSV at 0x7f0173c9cd30>,
 'reclen': 160,
 'vars': [{'length': 16,
   'name': 'Make',
   'offset': 0,
   'rtype': 'CHAR',
   'type': 'VARCHAR'},
  {'length': 16,
   'name': 'Model',
   'offset': 16,
   'rtype': 'CHAR',
   'type': 'VARCHAR'},
  {'length': 16,
   'name': 'Type',
   'offset': 32,
   'rtype': 'CHAR',
   'type': 'VARCHAR'},
  {'length': 16,
   'name': 'Origin',
   'offset': 48,
   'rtype': 'CHAR',
   'type': 'VARCHAR'},
  {'length': 16,
   'name': 'DriveTrain',
   'offset': 64,
   'rtype': 'CHAR',
   'type': 'VARCHAR'},
  {'length': 8,
   'name': 'MSRP',
   'offset': 80,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'Invoice',
   'offset': 88,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'EngineSize',
   'offset': 96,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'Cylinders',
   'offset': 104,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'Horsepower',
   'offset': 112,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'MPG_City',
   'offset': 120,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'MPG_Highway',
   'offset': 128,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'Weight',
   'offset': 136,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'Wheelbase',
   'offset': 144,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'Length',
   'offset': 152,
   'rtype': 'NUMERIC',
   'type': 'SAS'}]}

The output of the addtable action includes a casTable key just like the upload method.



In [16]:

    
tbl = out['casTable']
tbl.head()









    Out[16]:





Selected Rows from Table CARS
  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      MDX
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      1
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0
    
    
      2
      Audi
      A41.8T convertible 2dr
      Sedan
      Europe
      Front
      35940.0
      32506.0
      1.8
      4.0
      170.0
      23.0
      30.0
      3638.0
      105.0
      180.0
    
    
      3
      Audi
      A6 3.0 4dr
      Sedan
      Europe
      Front
      36640.0
      33129.0
      3.0
      6.0
      220.0
      20.0
      27.0
      3561.0
      109.0
      192.0
    
    
      4
      Audi
      A6 2.7 Turbo Quattro 4dr
      Sedan
      Europe
      All
      42840.0
      38840.0
      2.7
      6.0
      250.0
      18.0
      25.0
      3836.0
      109.0
      192.0

Just as with the read_XXX methods on the connection, the PandasDataFrame-based data message handlers take the same arguments as the read_XXX methods to customize their behavior. Using Pandas' HTML reader as an example again, we can upload data with dates in them. Note that in this case, we can specify the index of the HTML table that we want in the index= parameter.



In [17]:

    
html_dmh = dmh.HTML('https://www.fdic.gov/bank/individual/failed/banklist.html', index=0, parse_dates=[5, 6])
out = conn.addtable(table='banklist', replace=True, **html_dmh.args.addtable)
tbl = out.casTable
tbl.head()









    Out[17]:





Selected Rows from Table BANKLIST
  
    
      
      Bank Name
      City
      ST
      CERT
      Acquiring Institution
      Closing Date
      Updated Date
      Loss Share Type
      Agreement Terminated
      Termination Date
    
  
  
    
      0
      The Woodbury Banking Company
      Woodbury
      GA
      11297
      United Bank
      2016-08-19
      2016-08-30
      none
      
      
    
    
      1
      Hometown National Bank
      Longview
      WA
      35156
      Twin City Bank
      2015-10-02
      2016-04-13
      none
      
      
    
    
      2
      Doral BankEn Espanol
      San Juan
      PR
      32102
      Banco Popular de Puerto Rico
      2015-02-27
      2015-05-13
      none
      
      
    
    
      3
      Northern Star Bank
      Mankato
      MN
      34983
      BankVista
      2014-12-19
      2016-01-06
      none
      
      
    
    
      4
      GreenChoice Bank, fsb
      Chicago
      IL
      28462
      Providence Bank, LLC
      2014-07-25
      2015-07-28
      none

Conclusion

In this article, we have demonstrated three ways of getting data from Python into CAS. The first was using the read_XXX methods that emulate Pandas' data readers. The second was using the upload method of the CAS connection to upload a file and have the server read it. The third was using data message handlers to add data using the addtable action. Each method has its strengths and weaknesses. The upload method is generally the fastest, but has the most limited parsing capabilities. The read_XXX methods are quick and convenient and give you the power of Pandas data readers. While we didn't get deep into data message handlers here, they allow the most power by allowing you to subclass them and write custom data loaders from any data source.



In [18]:

    
conn.close()



In [ ]:

	Make	Model	Type	Origin	DriveTrain	MSRP	Invoice	EngineSize	Cylinders	Horsepower	MPG_City	MPG_Highway	Weight	Wheelbase	Length
0	Acura	MDX	SUV	Asia	All	36945.0	33337.0	3.5	6.0	265.0	17.0	23.0	4451.0	106.0	189.0
1	Acura	3.5 RL 4dr	Sedan	Asia	Front	43755.0	39014.0	3.5	6.0	225.0	18.0	24.0	3880.0	115.0	197.0
2	Audi	A41.8T convertible 2dr	Sedan	Europe	Front	35940.0	32506.0	1.8	4.0	170.0	23.0	30.0	3638.0	105.0	180.0
3	Audi	A6 3.0 4dr	Sedan	Europe	Front	36640.0	33129.0	3.0	6.0	220.0	20.0	27.0	3561.0	109.0	192.0
4	Audi	A6 2.7 Turbo Quattro 4dr	Sedan	Europe	All	42840.0	38840.0	2.7	6.0	250.0	18.0	25.0	3836.0	109.0	192.0

	Bank Name	City	ST	CERT	Acquiring Institution	Closing Date	Updated Date	Loss Share Type
0	The Woodbury Banking Company	Woodbury	GA	11297	United Bank	August 19, 2016	August 30, 2016	none
1	Hometown National Bank	Longview	WA	35156	Twin City Bank	October 2, 2015	April 13, 2016	none
2	Doral BankEn Espanol	San Juan	PR	32102	Banco Popular de Puerto Rico	February 27, 2015	May 13, 2015	none
3	Northern Star Bank	Mankato	MN	34983	BankVista	December 19, 2014	January 6, 2016	none
4	GreenChoice Bank, fsb	Chicago	IL	28462	Providence Bank, LLC	July 25, 2014	July 28, 2015	none

	Column	ID	Type	RawLength	FormattedLength	Format
0	Bank Name	1	varchar	90	90
1	City	2	varchar	17	17
2	ST	3	varchar	2	2
3	CERT	4	int64	8	12
4	Acquiring Institution	5	varchar	65	65
5	Closing Date	6	datetime	8	16	DATETIME
6	Updated Date	7	datetime	8	16	DATETIME
7	Loss Share Type	8	varchar	7	7
8	Agreement Terminated	9	varchar	9	9
9	Termination Date	10	varchar	43	43

	A	B	C	D
0	0.702470	-0.649162	-0.666066	1.319843
1	1.285588	-0.326041	1.249939	0.110953
2	-0.212528	-0.180530	-0.586989	0.641180
3	-1.227102	2.422193	0.026077	-0.173956
4	-0.724691	0.011051	0.868402	0.512783

Loading Data from Python into CAS

Loading Data using Pandas-like Parser Methods

Loading Data Using the upload Method

Using Data Message Handlers

Conclusion

Loading Data Using the `upload` Method